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Abstract 

Dictionary learning and sparse representation (DLSR) is a recent and suc¬ 
cessful mathematical model for data representation that achieves state-of- 
the-art performance in various fields such as pattern recognition, machine 
learning, computer vision, and medical imaging. The original formulation 
for DLSR is based on the minimization of the reconstruction error between 
the original signal and its sparse representation in the space of the learned 
dictionary. Although this formulation is optimal for solving problems such 
as denoising, inpainting, and coding, it may not lead to optimal solution in 
classification tasks, where the ultimate goal is to make the learned dictio- 
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nary and corresponding sparse representation as discriminative as possible. 
This motivated the emergence of a new category of techniques, which is ap¬ 
propriately called supervised dictionary learning and sparse representation 
(S-DLSR), leading to more optimal dictionary and sparse representation in 
classification tasks. Despite many research efforts for S-DLSR, the literature 
lacks a comprehensive view of these techniques, their connections, advan¬ 
tages and shortcomings. In this paper, we address this gap and provide a 
review of the recently proposed algorithms for S-DLSR. We first present a 
taxonomy of these algorithms into six categories based on the approach taken 
to include label information into the learning of the dictionary and/or sparse 
representation. For each category, we draw connections between the algo¬ 
rithms in this category and present a unified framework for them. We then 
provide guidelines for applied researchers on how to represent and learn the 
building blocks of an S-DLSR solution based on the problem at hand. This 
review provides a broad, yet deep, view of the state-of-the-art methods for 
S-DLSR and allows for the advancement of research and development in this 
emerging area of research. 

Keywords: dictionary learning, sparse representation, supervised learning, 
classification 


1. Introduction 

There are many mathematical models to describe data with varying de¬ 
grees of success, among which dictionary learning and sparse representation 
(DLSR) have attracted the interest of many researchers in various fields. 
Dictionary learning and sparse representation are two closely-related top- 
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ics that have roots in the decomposition of signals to some predefined ba¬ 
sis, such as the Fourier transform. Representation of signals using prede¬ 
fined basis is based on the assumption that these basis are sufficiently gen¬ 
eral to represent any kind of signal. However, recent research shows that 
learning the basi^j] from data, instead of using off-the-shelf ones, leads to 
state-of-the-art results in many applications such as audio processing [I], 
data representation and column selection [21 E], emotion recognition g], 
face recognition image compression [Sj, denoising [j3], and inpaint¬ 

ing [TO] , image super-resolution mi, medical imaging H2HH, motion and 
data segmentation |15l [16], signal classification [171119] . and texture analy¬ 
sis [2DH23] • feet? what makes DLSR distinct from the representation using 
predefined basis is: first, the basis arc learned from the data, and second, only 
a few components in the dictionary are needed to represent the data (sparse 
representation). This latter attribute can also be seen in the decomposition 
of signals using some predefined basis such as wavelets f2i \. 

Although methods for dictionary learning and sparse representation gained 
popularity in many domains, their performance is sub-optimal in classifica¬ 
tion tasks, as they do not exploit the label information in the learning of 
the dictionary atoms and the coefficients of the sparse approximation. This 
motivates the emergence of a new category of techniques that utilize label in¬ 
formation in computing either dictionary, coefficients, or both. This branch 
of DLSR is called supervised dictionary learning and sparse representation 

1 Here, the term basis is loosely used as the dictionary can be overcomplete, i.e., the 
number of dictionary elements can be larger than the dimensionality of the data, and its 
atoms are not necessarily orthogonal and can be linearly dependent. 
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(S-DLSR), and methods for S-DLSR have shown superior performance in a 
variety of supervised learning tasks [25H27j. 

With the several attempts for learning the dictionary and coefficients 
in a supervised manner, the literature lacks a comprehensive view of these 
methods and their connections. In this paper, we present a review of the 
state-of-the-art techniques in S-DLSR, draw connections between methods, 
and provide a practical guide for applied researchers in this held on how to 
design an S-DLSR algorithm. In specific, the contributions of this paper are 
summarized as follows. 

1. The paper proposes a taxonomy of S-DLSR methods into six categories 
based on how the label information is included into the learning of the 
dictionary and/or sparse coefficients. This taxonomy allows the reader 
to understand the landscape of existing methods and how they relate 
to each other. 

2. For the major categories, the paper provides a unified mathematical 
framework for representing the methods in this category. 

3. The paper discusses the advantages and shortcomings of the methods 
in each category and the applications where the usage of these methods 
is preferred. 

4. The paper summarizes the state-of-the-art S-DLSR methods based on 
their building blocks (i.e., dictionary, sparse coefficients, and the classi¬ 
fier parameters) from the learning and representation perspective and 
provides guidelines to the applied researchers in the held on how to 
design these building blocks based on the application at hand. 

The comprehensive view of S-DLSR methods presented in this paper will 
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facilitate further contributions in this interesting and useful area of research, 
and allows the applied researchers to build efficient and effective solutions 
for different applications. 

The rest of the paper is organized as follows. Section [2] provides the back¬ 
ground for and the related topics to dictionary learning and sparse represen¬ 
tation. Particularly, in Subsection |2.3 we present the classical formulation 
of DLSR as an unsupervised dictionary learning approach, which is mainly 
optimized for the applications such as coding and denoising where the recon¬ 
struction of the original signals as accurate as possible is the main concern. In 
Section [3j the main supervised dictionary learning and sparse representation 
(S-DLSR) methods proposed in the literature are reviewed and categorized 
depending on how the category information is included into the learning of 
the dictionary and/or sparse coefficients. Section [I] provides a summary for 
the S-DLSR methods and how to build them based on three building blocks, 
i.e., the dictionary learning, sparse representation, and learning the classifier 
model. Section [5] concludes the paper. 


2. Background 

2.1. Related Topics 

The concept of dictionary learning and sparse representation originated 
in different communities attempting to solve different problems, which are 
given different names. Some of these problems are: sparse coding (SC), which 
was originated by neurologists as a model for simple cells in mammalian pri¬ 
mary visual cortex [281 125] ; independent component analysis (ICA), which 
was developed by researchers in signal processing to estimate the underlying 
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hidden components of multivariate statistical data (refer to [3D, HI] for a re¬ 
view of ICA); least absolute shrinkage and selection operator (lasso), which 
was used by statisticians to find linear regression models when there are 
many more predictors than samples, and some constraints have to be con¬ 
sidered to fit the model. In the lasso, one of the constraints introduced by 
Tibshirani was the i i norm that led to sparse coefficients in the linear regres¬ 
sion model |32| . Another technique that also leads to DLSR is nonnegative 
matrix factorization (NNMF), which aims at decomposing a matrix to two 
nonnegative matrices, one of which can be considered to be the dictionary, 
and the other the coefficients [33 j . In NNMF, usually both the dictionary 
and coefficients are sparse [33, [33]. This list is not complete, and there are 
variants for each of the above techniques, such as blind source separation 
(BSS) [36], compressed sensing [37], basis pursuit (BP) [38], and orthogonal 
matching pursuit (OMP) [3DJ0D]. The reader is referred to [3T1133] for some 
reviews on these techniques. Figure [l] summarizes the topics related to and 
the applications of dictionary learning and sparse representation. 

The main results of all these research efforts is that a class of signals 
with sparse nature, such as images of natural scenes, can be represented 
using some primitive elements that form a dictionary, and each signal in 
this class can be represented by using only a few elements in the dictionary, 
i.e., by a sparse representation. In fact, there are at least two ways in the 
literature to exploit sparsity [23]: first, using a linear/nonlinear combination 
of some predefined basis, e.g., wavelets El; second, using primitive elements 
in a learned dictionary, such as the techniques employed in SC or ICA. This 
latter approach is the focus of this paper. 
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Figure 1: Topics related to and the applications of dictionary learning and 
sparse representation. 


2.2. Taxonomy of DLSR Methods 

One may categorize the various dictionary learning with sparse represen¬ 
tation approaches proposed in the literature in different ways: one where 
the dictionary consists of predefined or learned basis as stated above, and 
the other based on the model used to learn the dictionary and coefficients. 
These models can be generative as used in the original formulation of SC [28] . 
ICA [3D], and NNMF [33] : reconstructive as in the lasso [33] : or discrimina¬ 
tive such as SDL-D (supervised dictionary learning-discriminative) in [25] . 
The two former approaches do not consider the class labels in building the 
dictionary, while the discriminative one does. In other words, dictionary 
learning can be performed unsupervised or supervised, with the difference 
that in the latter, the class labels in the training set are used to build a more 
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Table 1: The list of notations and their definitions in this paper. 


Notation 

Definition 

Notation 

Definition 

X 

a finite set of data samples 

X* 

the group of data samples in class i 

x i 

the data sample 

x ij 

the data sample in class i 

X 2 3 

a constituent of signal Xj 

x ts 

a test data sample 

X 

a random variable representing the 
data samples 

D 

dictionary 

E>i 

the subdictionary learned on class i 

d; 

the £ th column of D 

A a 

the dictionary atom in subdic¬ 

tionary learned in class i 

T> 

a random variable representing the 
dictionary atoms 

A 

sparse coefficients 

A i 

part of sparse coefficients corre¬ 
sponding to class i 

a 3 

part of sparse coefficients that rep¬ 
resent class i over Dj 

OL i 

the i * column of A 

c*l 

the sparse coefficient corresponding 
to signal constituent ~xP- 

L, l 

loss function 

Y 

class labels 

y 

a random variable representing 
class labels 

h 

a histogram 

n 

a random variable representing his¬ 
tograms 

H 

centering matrix 

i 

identity matrix 

K 

kernel on data 

L 

kernel on labels 

W 

classifier parameters to be learned 

U 

a transformation/projection to be 
computed 

Q 

optimal discriminative sparse codes 

a 

an incoherence term 

■S'b 

between-class covariance matrix 

S w 

within-class covariance matrix 

Sp 

A sigmoid function with the slope 
of ft 

Si 

the cluster 


joint probability 

H-,) 

mutual information shared by two 
random variables 

R(.) 

The ratio of intra- to inter-class re¬ 
construction error 

£(.) 

logistic regression function 

Si 

a characteristic function that se¬ 
lects the coefficients associated with 
class i 

ri( •) 

the residual error between a data 
sample and its reconstructed ver¬ 
sion 

0 

a generic sparsity inducing function 

A, ,\ 0 , Ai, A 2 , tj, 7 

regularization parameters 

II-IIf 

Frobenius norm 

II ■ II1 

£ 1 norm 

e 

a vector of all ones 

tr(.) 

trace operator 

n 

the number of data samples 

m 

the number of data samples in a 
class 

V 

the dimensionality of data 

c 

the number of classes 

k 

the number of dictionary atoms 

ki 

the number of dictionary atoms in 
class i 


discriminative dictionary for the particular classification task at hand. 

2.3. Unsupervised Dictionary Learning 

Considering a finite training set of signal^ X = [xi, X 2 ,x„] G M pxn , 
where p is the dimensionality and n is the number of data samples, according 
to classical dictionary learning and sparse representation (DLSR) techniques 
(refer to [TD143] for a recent review on this topic), these signals can be repre- 

2 For the convenience of readers, the list of main notations in this review paper is 

provided in Table [lj 


































sented by a linear decomposition over a few dictionary atoms by minimizing 
a loss function as given below 

n 

L(X,D,A) = ^Z(x i ,D,A), (1) 

1=1 

where L and l are the overall and per data sample loss functions, respectively, 
D e M pxfc is the dictionary of k atoms, and A 6 M fcxn are the coefficients. 

The loss function can be defined in various ways based on the application 
at hand. However, what is common in DLSR literature is to define the 
loss function L as the reconstruction error in a mean-squared sense, with a 
sparsity-inducing function ^ as a regularization penalty to ensure the sparsity 
of coefficients. Hence, ([Tj) can be written as 

L(X, D. A) = min 1||X - BA\\ 2 F + A^(A), (2) 

D,A Z 

where subscript F indicates the Frobenius nornj^] and A is the regularization 
parameter that affects the number of nonzero coefficients. 

An intuitive measure of sparsity is &q nomj^J which indicates the number 
of nonzero elements in a vector. However, the optimization problem obtained 
from replacing sparsity-inducing function i/j in (J2]) with £q is non-convex and 
NP-hard (refer to [l2j for a recent comprehensive discussion on this issue). 
Two main categories of approximate solutions have been proposed to over¬ 
come this problem: the first is based on greedy algorithms, such as the well- 

3 The Frobenius norm of a matrix X is defined as ||X|| F = 

4 The £ 0 norm of a vector x is defined as ||x|| 0 = #{* : Xi ^ 0}. 
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known orthogonal matching pursuit (OMP) [251 00, U2]; the second works 
by approximating a highly discontinuous £o norm by a continuous function 
such as the i\ norm. This leads to an approach which is widely known in the 
literature as lasso [32j or basis pursuit (BP) [3S], and ([2]) converts to 

n 

L(X,D, A) = min (±||xi - Dc^Hl + A||a 4 ||i) , (3) 

2=1 

where x,; is the i th training sample and a* is the i th column of A. 

The reconstructive formulation given in (|3]) is non-convex when both the 
dictionary D and coefficients A are unknown. However, this optimization 
problem is convex if it is solved iteratively and alternately on these two 
unknowns. Several fast algorithms have recently been proposed for this pur¬ 
pose, such as K-SVD [IS], online learning m el and cyclic coordinate 
descent [ 18] . 

In ([3]), the main optimization goal for the computation of the dictionary 
and sparse coefficients is minimizing the reconstruction error in the mean- 
squared sense. While this works well in applications where the primary 
goal is to reconstruct signals as accurately as possible, such as in denoising, 
image inpainting, and coding, it is not the ultimate goal in classification tasks 
as discriminating signals is more important here [43]. Recently, there have 
been several attempts to include category information in computing either 
dictionary, coefficients, or both. This branch of DLSR is called supervised 
dictionary learning and sparse representation (S-DLSR). In the following 
section, an overview of proposed S-DLSR approaches in the literature will be 
provided. 
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3. Taxonomy of Supervised Dictionary Learning and Sparse Rep¬ 
resentation Techniques 

In this section, the proposed supervised dictionary learning and sparse 
representation (S-DLSR) approaches in the literature are categorized into 
six different groups, depending on how the class labels are included into the 
learning of the dictionary and/or sparse coefficients. These six categories 
are: 1) learning one dictionary per class, 2) unsupervised dictionary learning 
followed by supervised pruning, 3) joint dictionary and classifier learning, 
4) embedding class labels into the learning of dictionary, 5) embedding class 
labels into the learning of sparse coefficients, and 6) learning a histogram of 
dictionary elements over signal constituents. We admit that the taxonomy 
proposed in this section is not unique and could be done differently. Also, it 
is worthwhile to mention that while the first five categories perform S-DLSR 
on whole signal, the last category performs it on signal constituents. In the 
rest of this section, the six categories are described and their advantages and 
disadvantages are discussed in details. 

3.1. Learning One Dictionary per Class 

The first and simplest approach to include category information in DLSR 
is computing one dictionary per class, i.e., using the training samples in 
each class to compute part of the dictionary, and then composing all these 
partial dictionaries into one. In providing the mathematical formulation for 
all the approaches in this category of S-DLSR, it is always assumed that 
the training samples are grouped based on the classes they belong to such 
that X = [Xi, X 2 ,..., X c ] e M pxn , where c is the number of classes and 
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Xj = [xj!, Xj 2 ,x im ] G M pxm is the group of m training samples in class 
%. Similarly, the dictionary D is described as D = [Di, Do,D c ] G M. pxk , 
where Dj = [dj!, d, 2 , ■■■, d^J G M? xki is the subdictionary of atoms in class 

i. 

Among the methods in this category, the most common ones are: 1) super¬ 
vised A;-means, 2) sparse representation-based classification (SRC), 3) metaface, 
and 4) dictionary learning with structured incoherence (DLSI). These meth¬ 
ods are described in the rest of this subsection. 


3.1.1. Supervised k-means 

Perhaps the earliest work in this direction is the one based on the so- 
called texton-based approach (2D1 23; 150H53] . The texton-based approach, 
can be considered as a dictionary learning approach particularly tailored for 
texture analysis. In this approach, textons, which are computed using the 
/c-means clustering algorithm over patches extracted from texture images, 
play the role of dictionary atoms. Although in a texton-based approach, the 
texture images are usually modeled with a histogram of textons, i.e., using 
a model of signal constituents, and hence, the approach falls mainly into 


the category of S-DLSR explained in Subsection |3.6[ the idea of using k- 
means and the computed cluster centers as the dictionary elements can still 
be considered here as an S-DLSR approach that computes one dictionary 
per class. Therefore, a specific name is suggested for this technique, i.e., 
supervised fc-means, to differentiate it from the texton-based approach. In 
supervised A;-means, the &-means algorithm is applied to the training samples 
in each class, and the k cluster centers computed are considered to be the 
dictionary for this class. These partial dictionaries are eventually composed 


12 



into one dictionary. 

In the mathematical framework, each subdictionary D; = [dji, d* 2 , ■■■, d^J € 
M pxfci can be computed using the training samples in class i, i.e., using 
Xj = [xj!, Xj 2 , ...,Xj m ] G M pxm and the optimization problem 

ki 

arg min EE ll x p-cy ( 4 ) 

Di 1=1 XijGSt 

where S = {S±, S 2 , ■■■, S^} are ki clusters that partition data samples Xj 
in class i. Usually, ki, the number of dictionary atoms computed per class, 
is the same over all classes. By composing all Dj into one dictionary such 
that D = [Di,D 2 , ...,D c ] G M. pxk , where k = ki ■ c, the whole dictionary is 
obtained. 

One can explain why it might be expected that a supervised /c-means 
performs better than an unsupervised one by understanding how A;-means 
compute the cluster centers: it essentially computes the cluster centers by 
taking the mean of the points. Hence, if A;-means was applied to the data 
points across classes, the resultant cluster centers might not correspond to 
the data points in any of the classes, and consequently the resultant cluster 
centers would not be identified uniquely with individual classes. In other 
words, the cluster centers computed using fc-means across classes would not 
be representing data samples in a class properly. Thus, in classification tasks, 
it will be beneficial, particularly at small dictionary sizes, to use Umeans for 
the data points in one class at a time [22 Table II]. 
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3.1.2. Sparse representation-based, classification (SRC) 

In their seminal work, Wright et al. [6] proposed to use the training sam¬ 
ples as the dictionary in a technique called sparse representation-based clas¬ 
sification (SRC). The approach was proposed in the application of face recog¬ 
nition and effectively falls into the same category as training one dictionary 
per class. However, no actual training is performed here, and the whole 
training samples are used directly in the dictionary. 

To describe SRC more formally, suppose that x ts G l p is a test sample. 
The SRC algorithm assigns the whole training set X to the dictionary D 
such that D; = X* for class i, and computes the sparse coefficients a. for the 
test sample x ts using the lasso given in (|3]) as follows 


min — 11 x ts 
a z 


— X«||l T AlHli 


(5) 


In the next step, the residual error is computed for the reconstruction 
of the test sample using the training samples of each class and their corre¬ 
sponding sparse coefficients 


f'i(xts) = ||x ts - X5j(o:) |||, (6) 

where Si is a characteristic function that selects the coefficients associated 
with class i. This residual error is found for each class separately, and then 
the class label of the given test sample is assigned according to 

label(x ts ) = arg min r,(x ts ). (7) 
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For a low to moderate training set size, this approach is computation¬ 
ally very efficient as there is no overhead for the learning of the dictionary. 
Moreover, using minimum residual error for the purpose of classification of 
an unseen test sample is easily interpretable as the class of the subdictionary 
leading to minimum residual error can be inspected and assigned as the class 
label of the test sample. The main disadvantage of this method, however, is 
that using the training samples as the dictionary in this approach may result 
in a very large and possibly inefficient dictionary, due to the noisy training 
instances. This is particularly the case in applications with large training set 
sizes. 


3.1.3. Metaface 

To obtain a smaller dictionary, Yang et al. proposed an approach called 
metaface , which learns a smaller dictionary for each class and then composes 
them into one dictionary [53]. Metaface was originally proposed for the 
application of face recognition, but it is general and can be used in any 
application. In this approach, each subdictionary D* is computed using the 
training samples X,; in class i using the formulation given in (|3]) as follows^ 

min ^||X.j - D,;Aj||| + A||A;||i. (8) 

D,;,A; Z 

where A$ e M. kiXm is the matrix of sparse coefficients representing Xj. 


5 In this paper, whenever l\ norm is used over a matrix, it is meant that t\ norms over 
each column of the matrix are summed such as what is used in ([3|. Hence the correct 
form for ([81) is: minn^A; (||| x y ‘ DAjj||| + A||Ay||i). However, similar forms as 

in (|8| are loosely used for t\ norm in the rest of this paper to avoid too long and complex 
formulations and to focus more on the concept. 
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Since this optimization problem is non-convex when both dictionary and 
coefficients are unknown, it has to be solved iteratively and alternately with 
one unknown variable considered fixed in each alteration. Computed subdic¬ 
tionaries are eventually composed into one dictionary D = [D!,D 2 , ...,D C ] G 
M pxfc . After the computation of the dictionary, the class label of a test sam¬ 
ple x ts is computed in the same way as explained in the SRC approach, i.e., 
by finding the coefficients for this test sample using the computed dictionary 
instead of the whole training set in (J5]) , followed by the computation of the 
residuals given in (|6]), and assigning the test sample to the class that yields 
the minimal residue. 

Although the metaface approach can potentially reduce the size of the 
dictionary compared to the SRC method, its major drawback is that the 
training samples in one class are used for computing the atoms in the cor¬ 
responding subdictionary, irrespective of the training samples from other 
classes. This means that if the training samples across classes have some 
common properties, these shared properties cannot be learned in common in 
the dictionary. 

3 .I. 4 . Dictionary learning with structured incoherence (DLSI) 

Ramirez et al. proposed to overcome the aforementioned problem with 
the metaface approach by including an incoherence term in (J3]) to encourage 
independency of dictionaries from different classes, while still allowing for 
different classes to share features [55] . 

To enable sharing features among the data points in different classes 
for learning the dictionary, instead of learning each D; independently and 
unaware of data points in other classes, a coherence term is added to the 
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lasso as described by the formulation below 


{ll X * “ d * A *IIf + A ||Ai||i} II D ^ D ;IIf ’ ( 6 * * 9 ) 

i =1 *A? 

where the last term is an incoherence term Qil) .1),). which has been pro¬ 
posed in [5SJ to be the inner product between the two sub dictionaries D, 
and Dj, but it could be defined differently as long as it includes some mea¬ 
sure of (dis)similarity/(in)coherence. In fact, the incoherence term in ([9]) 
discourages the similarity among the subdictionaries learned across different 
classes. After finding the dictionary, the classification of a test sample is 
performed the same way as with the SRC. 

Discussion. The advantage of the S-DLSR methods in this category is mainly 
the ease of the computation of the dictionary. In case of the SRC method, 
no learning is needed for the dictionary as the dictionary is the same as 
the training samples. However, the main drawback of all the approaches in 
this category is that they may lead to a very large dictionary, as the size 
of the composed dictionary grows linearly with the number of classes. An 
example is in face recognition where there are many classes. For example, 
in Extended Yale B database |56], there are 38 classes and learning even 10 
atoms per class (in SRC, all data instances in the training set are included 
in the dictionary) can easily lead to a large dictionary. 

6 Please note that the last term in § is an inner product and hence, a measure of 

similarity/coherence. However, since this term has been minimized in the optimization 

problem, it is called incoherence term by the authors in the original paper, which is also 

adopted here. 
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3.2. Unsupervised Dictionary Learning Followed by Supervised Pruning 

The second category of S-DLSR approaches learns a very large dictionary 
unsupervised in the beginning, then merges the atoms in the dictionary by 
optimizing an objective function that takes into account the category infor¬ 
mation. the two main methods in this category are: 1) an approach based 
on information bottleneck (IB), and 2) universal visual dictionary (UVD). 
The details of these methods are as follows. 

3.2.1. Information bottleneck (IB) 

One major work in the literature in this direction is based on agglomer- 
ative information bottleneck (AIB), which iteratively merges two dictionary 
atoms that cause the smallest decrease in the mutual information between 
the dictionary atoms and the class labels The discriminative power 

of a dictionary D is characterized by the AIB as the amount of mutual in¬ 
formation I(V,y) shared by random variables T> (dictionary atoms) and y 
(category information): 

n^.y) = ET, p ( d -yVo!ifm^ r] («>) 

devyey K ’ yy> 

where the joint probability P(d, y ) is estimated from the data by counting the 
number of occurrences of dictionary atoms d in each category y = {1, ..., c}. 
The mutual information I(d, y ) is monotonically decreased as the AIB it¬ 
eratively compresses the dictionary by merging dictionary atoms such that 
smallest decrease in the mutual information (discriminating power) I(T>,y) 
occurs. This is continued until a predefined dictionary size is obtained. Al¬ 
though the approach is slow, a solution called “Fast AIB” has been proposed 
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in 152 to make it computationally efficient. 


3.2.2. Universal visual dictionary (UVD) 

Another major work is based on merging two dictionary atoms so as to 
minimize the loss of mutual information between the histogram of dictionary 
atoms over signal constituents, e.g., image patches, and class labels [58]. 
From this point of view, the difference between this approach and the one 
based on AIB is in the way they measure the discriminative power of the dic¬ 
tionary. In this approach, rather than measuring the discriminative power 
of the dictionary on individual dictionary atoms, it is measured on the his¬ 
togram of dictionary atoms h over signal constituents. Therefore, I((H,y), 
where U is the random variable over the histograms h is considered in UVD, 
instead of I((D,y) used by AIB. However, since the dimensionality of his¬ 
tograms tends to be very high, the estimation of I(H,y) is only possible 
with strong assumptions on the histograms. In [58] . it is assumed that his¬ 
tograms can be modeled using a mixture of Gaussians, with one Gaussian per 
category. Based on this assumption, in [58] . category posterior probability 
p(y\h) is used instead of mutual information I(H,y) for characterizing the 
discriminative power of the dictionary. Since this approach works on a his¬ 
togram of dictionary atoms over signal constituents, it can also be categorized 


in the sixth category of S-DLSR explained in Subsection 3.6 


Discussion. One main drawback of this category of S-DLSR is that the re¬ 
duced dictionary obtained performs, at best, as good as the original one. 
Since the initial dictionary is learned in an unsupervised manner, even though 
with its large size, it includes almost all possible atoms that helps to improve 
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the performance of the classification task [5U1 ITT] . the consecutive pruning 
stage is inefficient in terms of computational load. This might be one of 
the reasons that this category of S-DLSR has attracted less attention among 
other S-DLSR approaches in the literature as the efficiency of the method 
can significantly be improved by finding a discriminative dictionary from the 
beginning. 

3.3. Joint Dictionary and Classifier Learning 

The third category of S-DLSR, which is based on several research works 
published in [ 25] [26] I621T65] can be considered a major leap in the held. In this 
category, the classifier parameters and the dictionary are learned in a joint 
optimization problem. The main methods in this category are: 1) supervised 
dictionary learning-discriminative (SDL-D), 2) discriminative K-SVD (DK- 
SVD), 3) label consistent K-SVD (LCK-SVD), and 4) Bayesian supervised 
dictionary learning, which are described in the following subsections. 

3.3.1. Supervised dictionary learning-discriminative (SDL-D) 

Mairal et al. were one of the first research teams who proposed a joint 
optimization problem for learning the dictionary and the classifier parame¬ 
ters [25l 26ji62]. hi [25] they proposed the following formulation 

n 

min [S'C{y i f{*i,ci i , W)) + A 0 ||x; - 

D,W,A ' 

i=1 

+Ai ||aj||i^ + A 2 1 |W|| f , (11) 
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where C(x) = log(l + e~ x ) is the logistic loss function, (y,- e { — 1, +1})” =1 are 
binary class label^J /(.) is the classifier function, and W is the associated 
classifier parameters to be learned. In ( [IT] ), Ao is the parameter that controls 
the relative importance of the reconstruction error and the loss function on 
the classifier, Ai is the regularization parameter that controls the level of 
sparsity of the coefficients, and A 2 is the regularization parameter to prevent 
overfitting the classifier. The actual discriminative formulation proposed 
in |25| is sufficiently more complex than © and its description is not pro¬ 
vided here. The optimization problem in ( JlTj ) , is a non-convex problem and 
has many parameters to tune, which makes the approach computationally 
expensive. 

3.3.2. Discriminative K-SVD (DK-SVD) 

In [63J, Zhang and Li proposed a technique called discriminative K-SVD 
(DK-SVD). DK-SVD truly jointly learns the classifier parameters and dictio¬ 
nary, without alternating between these two steps. This prevents the possi¬ 
bility of the solution to get stuck in some local minima. However, only linear 
classifiers are considered in DK-SVD, which may lead to poor performance 
in difficult classification tasks. 

To provide the formulation for DK-SVD, one may notice that after learn¬ 
ing the dictionary using the lasso (|3]), a linear classifier is to be learned on 
the coefficients A in the space of learned dictionary. Suppose that W e M cxfc 
is the matrix of classifier parameters (c is the total number of classes and 
k is the number of dictionary atoms), and Y G M cxn includes the class la- 

7 The approach can be easily extended to multiclass problem. 
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bels (n is the number of training samples) such that each column of Y is 
y { = {0,1,0} T , i.e., there is exactly one nonzero element in each col¬ 
umn of Y, whose position indicates the class of the corresponding training 
sample. The classifier can be learned using least square formulation by min¬ 
imizing the classifier error in the mean-squared sense using the optimization 
problem as follows 

min — IIY — WAIIp. (12) 

w 2" IIF v ; 

This optimization problem can be combined with the lasso (|3]) into one 
optimization problem 

D minJl|X - DA||i + |||Y - WA||| + AHAt (13) 


To find the dictionary, coefficients, and the classifier, the optimization prob¬ 


lem given in (13) has to be solved iteratively and alternately, with two of 


these unknowns fixed each time and solving for the third. This makes the 
solution slow and very likely to get stuck in some local minima. To partially 
overcome these problems, it is proposed in |63| to combine the first two terms 


in (13) into one term as follows 


1 

min - 
D.W.A 2 


X 

v^Y 


D 

v/yW 


A 


+ All Alb. 


(14) 


Considering 


X',V7Y' 


T 


as a new training set 6 ]g>(p+c)xn and 


D T , a/7 W 1 


t 
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as a new dictionary G M(P +c ) xfc , (14) is converted to the lasso 


min -||X N - D N A||p + A||A||i, 

Dn,A z 


( 15 ) 


and can efficiently be solved by one of the recently developed fast algorithms 
for this purpose such as K-SVD B5I- Deriving D and W from Dn is straight¬ 
forward and the details are provided in 


3.3.3. Label consistent K-SVD (LCK-SVD) 

Inspired by the DK-SVD as described in previous subsection, Jiang et 
al. [66| proposed label consistent K-SVD (LCK-SVD). In DK-SVD, although 
the linear classifier W and dictionary D are learned in one optimization 
problem, there is no mechanism to ensure that the dictionary learned is 
discriminative. To overcome this problem, it is suggested in [66] to enforce 
a label consistency constraint on the dictionary by adding one additional 


term to the optimization problem of DK-SVD given in (13). The LCK-SVD 
optimization problem is, therefore, as follows: 


min i||X-DA||| + ^||Q»UA||| +J||Y- WA||| + A||A|| l5 

D,A Z Z Z 

W.A 


(16) 


where the added second term enforces the label consistency on the dictionary. 


In other words, the second term in (16) enforces the coefficients A to be as 


similar as possible to the optimal discriminative sparse codes in Q. In (16), 
Q G R kxn is encoding the optimal discriminative sparse coefficients, U G 
R kxk is a linear transformation matrix, and r] is a parameter that controls 
the relative contribution of the label consistency term. Each column of Q is 
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q, = {0,1,1,0} T , where the locations of ones correspond to the optimal 
nonzero sparse coefficients representing a data sample x*. For example, if 
both X and D consist of six columns (six data samples and six dictionary 
atoms), such that there are two vectors in a three-class problem, Q has to 
be defined as: 


1 1 0 0 0 0 
1 1 0 0 0 0 
0 0 110 0 
0 0 110 0 
0 0 0 0 1 1 
0 0 0 0 1 1 


(17) 


Similar to DK-SVD, the first three terms in (16) can be combined into 
one term as follows: 


1 

min - 
d,a,w,u 2 



X 


D 



Vv Q 

— 

V^ u 

A 


yrY 


yqW_ 



+ A|| AH,. 


(18) 


Let 


D T ,VpU',V7W 


X',v^ Q T ,V7Y 


r T 


be a new training set X N e I gdp+fc+c)xn anc [ 


T 


be a new dictionary Dn € j£(p+fc+c)xfc^ j s con _ 


verted to the form given in (15), which can be again efficiently solved by one 
of the recently developed fast algorithms for this purpose such as K-SVD 
Subsequently, D, U, and W can be easily derived from D N . 
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3.3-4- Bayesian supervised dictionary learning 

Dictionary learning based on Bayesian models was first proposed by Zlion 
et al. |SZl EH]- However, the method did not take into account the class 
labels in learning the dictionary and hence, was not optimal for classification 
tasks. In order to overcome this problem, recently, a non-parametric Bayesian 
technique has been proposed to jointly learn the dictionary, classifier, and 
sparse coefficients using beta-Bernoulli process [55] . 

Discussion. The idea used in this category of S-DLSR is more sophisticated 
than the previous two. However, the major disadvantage especially with the 
first approach in this category, i.e., SDL-D, is that the optimization problem 
is non-convex and complex. If the optimization is performed alternately be¬ 
tween learning the dictionary and classifier parameters, it is quite likely to 
become stuck in some local minima. On the other hand, due to the com¬ 
plexity of the optimization problem (except for the bilinear classifier in [25]). 
linear classifiers are merely considered in this category, which are usually 
too simple to solve difficult classification tasks, and can only be successful in 
simple ones as shown in [25]. Another major problem with the approaches 
in this category of S-DLSR is that there exist many parameters involved in 
the formulation, which are hard and time-consuming to tune (see for exam¬ 
ple [25] [26]). 

3-4- Embedding Class Labels into the Learning of Dictionary 

The fourth category of S-DLSR approaches includes the category infor¬ 
mation in the learning of the dictionary. Among the approaches in this 
category, Gangeli et al. [27] and Zhang et al. (7D| have proposed to learn the 
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dictionary and sparse coefficients in a more discriminative (in some sense) 
projected space, whereas Lazebnik and Raginsky mi included the category 
information into the learning of the dictionary by minimizing the information 
loss due to predicting the class labels in the space of the learned dictionary 
instead of the original space. The details of these methods are as follows. 


3.4-1- HSIC-based supervised dictionary learning 

Recently, Gangeh et al. [27] proposed an S-DLSR method based on Hilbert 
Schmidt independence criterion (HSIC). HSIC is a kernel-based indepen¬ 
dence measure between two random variables X and y ra. It computes 
the Hilbert-Schmidt norm of the cross-covariance operators in reproducing 
kernel Hilbert Spaces (RKHSs) [72.173]. 

In practice, HSIC is estimated using a finite number of data samples. Let 
Z := {(xi,yi, ), ..., (x n ,y n )} C X x y be n independent observations drawn 
from p := Pxxy■ The empirical estimate of HSIC can be computed using [72] 


HSIC(Z) = 


(n - 1) : 


r tr(KHLH), 


(19) 


where tr is the trace operator, H.K.L G M. nxn ,Kjj = k(xi,Xj), Lij = 
KUiiUj ), and H = I — n _1 ee T (I is the identity matrix, and e is a vector 


of n ones, and hence, H is the centering matrix). According to (19), max¬ 


imizing the empirical estimate of HSIC, i.e., tr(KHLH), will lead to the 
maximization of the dependency between two random variables X and y. 

The HSIC-based S-DLSR learns the dictionary in a space where the de¬ 
pendency between the data and corresponding class labels is maximized. To 
this end, it has been proposed in [27j to solve the following optimization 
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problem 


max tr(U XHLHX U), 


( 20 ) 


s.t. 


U U = I 


where X = [x 1; x 2 ,x n ] € M pxn is n data samples with the dimensionality 
of p: H is the centering matrix, and its function is to center the data, i.e., to 
remove the mean from the features: L is a kernel on the labels Y; and U is 
the transformation that maps the data to the space of maximum dependency 
with the labels. According to the Rayleigh-Ritz Theorem [73], the solution 


for the (20) is the top eigenvectors of $ = XHLHX corresponding to its 


largest eigenvalues. 


To explain how the optimization problem provided in (20) learns the 


dictionary in the space of maximum dependency with the labels, using a few 


manipulations, we note that the objective function given in (20) has the form 


of empirical HSIC given in (19), i.e., 


max tr(U T XHLHX T U) 

= max tr(X T UU T XHLH) 
u v 


= max tr 
u 


(U T X) T U T X 


HLH 


= max tr(KHLH), 
u 


( 21 ) 


where K = (U T X) T U T X is a linear kernel on the transformed data in the 


subspace U X. To derive (21), it is noted that the trace operator is invariant 
under cyclic permutation. 


Now, it is easy to observe that the form given in (21) is the same as the 
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empirical HSIC in (19) up to a constant factor and therefore, it can be easily 


interpreted as transforming centered data X using the transformation U to a 
space where the dependency between the data and class labels is maximized. 
In other words, the computed transformation U constructs the dictionary 
learned in the space of maximum dependency between the data and class 
labels. 

After finding the dictionary D = U, the sparse coefficients can be com¬ 
puted using the formulation given in (|3]) [2.7]. 

One main advantage of the HSIC-based S-DLSR is that both dictionary 
and sparse coefficients can be computed in closed form [27], which makes 
the approach computationally very efficient. Another main advantage of the 
approach is that it can be easily kernelized and therefore, by embedding an 
appropriate kernel into the solution, subtle classification tasks can be solved 
with high accuracy. The approach, however, does not allow overcomplete 
dictionaries due to the orthogonality constraint imposed on the transforma¬ 
tion. This might be of little concern as it has been shown that the method 
works comparably well at small dictionary sizes 


3-4-2. Discriminative projection and dictionary learning 

In the same line as HSIC-based S-DLSR, Zhang et al. Pi also proposed 
to learn the dictionary and the sparse representation in a more discrimina¬ 
tive (in some sense, which will be defined in next lines) space. To this end, 
they propose to first project the data to an orthogonal space where the intra- 
and inter-class reconstruction errors are minimized and maximized, respec¬ 
tively, and subsequently learn the dictionary and the sparse representation 
of the data in this space. Intra-class reconstruction error for a data sample 



Xj is defined as the reconstruction error using the dictionary atoms in the 
ground-truth class of x, under the metric UU T (U is the projection to be 
learned), whereas inter-class error is defined as the reconstruction error using 
the dictionary atoms other than the ground-truth class of x, under the same 
metric. 

To provide the mathematical formulation, given a set of training set X G 
M pxn , the task is to learn a discriminative trasnformation/projection U G 
M pxm , where m < p is the number of basis, and dictionary D G M. pxk using 
the optimization problem given below 

1 n 

SS -X( S «s(flW) + A||a i || 1 ) 

i =i (22) 

s.t. U T U = I 


where Sp(x) = l+e p(i-x) is a sigmoid function centered at 1 with the slope of [3, 
and i2(xj) is the ratio of intra- to inter-class reconstruction errors. Sp(R(x.i)) 
can be intuitively considered as the inverse classification confidence and by 
minimizing this term over the training samples in the objective function 


of (22), the discriminative projections U and dictionary D are empirically 


learned subject to a sparsity constraint imposed as the second term in (22). 


In (22), cti is the sparse representation of the projected data sample U x, ; 
in the space of dictionary learned in the projected space U T D, i.e., 


oci = nun ||U 1 x. t - U DaJ + Allah b 


(23) 


The optimization problem given in (22) and (23) has to be solved alter 


29 








nately between sparse coding (using (23) with U and D fixed) and learning 
the dictionary and projected space (using ( |22| with fixed sparse coefficients 
A). This optimization problem is non-convex and the projection and dictio¬ 
nary have to be learned iteratively and alternately using gradient descent. 
Therefore, unlike HSIC-based S-DLSR, there exist no closed-form solutions 
here and the algorithm may get stock in some local minima. 


3-4-3. Information loss minimization (info-loss) 

Lazebnik and Raginsky proposed in eh to include category information 
into the learning of the dictionary, by minimizing the information loss due 
to predicting labels from a supervised dictionary learned instead of original 
training data samples. This approach is known as info-loss in the S-DLSR 
literature. In fact, in S-DLSR, the ultimate goal is to represent the original 
high-dimensional feature space by a dictionary such that it can facilitate the 
prediction of the class labels correctly. Ideally, the dictionary should maintain 
all discriminative power of the original feature space. However, some of this 
information is lost during the quantization of the feature space. In [ZU, it 
has been proposed to learn the dictionary such that the information loss 


i(x,y)-i(v,y) 


(24) 


is minimized, where / indicates the mutual information between its argu¬ 
ments as random variables, and A, T>, and y are the random variables on 
the original feature space X, learned dictionary D, and the class labels Y, 
respectively. 

Just the same as in the previous category of S-DLSR, the info-loss ap- 
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proach has the major drawback that it may become stuck in local minima. 
This is mainly because the optimization has to be done iteratively and alter¬ 
nately on two updates, as there is no closed-form solution for the approach. 

3.4-4- Randomized clustering forests (RCF) 

In [61], it is proposed to learn the dictionary atoms using extremely ran¬ 
domized decision trees. This approach also falls into the second category of 
SDLs, as it seems that it starts from a very large dictionary using random 
forests, and tries to prune it later to conclude with a smaller dictionary. 

Discussion. The idea of learning the dictionary and sparse coefficients in a 
more discriminative projected space introduced by the first two approaches 
in the category, i.e., HSIC-based S-DLSR and discriminative projection and 
dictionary learning opens a very promising avenue of research in the field of 
S-DLSR. Based on this two methods, the projection to a discriminative space 
can be defined in different ways depending on some criteria related to the 
problem at hand. If the projection/dictionary are defined to be orthonormal, 
the learning of the coefficients can be performed in closed form mm using 
soft-thresholding US- With a careful selection of the discriminative criterion, 
it might be also possible to find a closed-form solution for the dictionary 
such as the one found in HSIC-based S-DLSR that can further improve the 
performance of the approach in terms of computation time. 

3.5. Embedding Class Labels into the Learning of Sparse Coefficients 

The fifth category of S-DLSR includes class category in the learning of 
coefficients [?9j or in the learning of both dictionary and coefficients [ZlCTj. 
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Supervised coefficient learning in all these papers [3 EH [77] has been per¬ 
formed more or less in the same way using the Fisher discrimination crite¬ 
rion [78], i.e., by minimizing the within-class covariance of coefficients and 
at the same time maximizing their between-class covariance. As for the dic¬ 
tionary, while Huang et al. pf9] have used predefined basis by deploying an 
overcomplete dictionary as a combination of Haar and Gabor basis, Yang et 
al. [TJ have proposed a discriminative fidelity term to learn the dictionary, 
for which further description is provided below, along with the learning of 
the coefficients. 

3.5.1. Fisher discrimination dictionary learning (FDDL) 

In [7], an approach called Fisher discrimination dictionary learning (FDDL) 
has been proposed, that uses category information in learning both dictionary 
and sparse coefficients. To learn the dictionary supervised, a discriminative 
fidelity term has been proposed that encourages learning dictionary atoms 
of one class from the training samples of the same class, and at the same 
time penalizes their learning by the training samples from other classes. As 
stated above, the coefficients have been learned supervised, by including the 
Fisher discriminant criterion in their learning. 

To provide a mathematical formulation for FDDL, suppose that the train¬ 
ing samples are grouped according to the classes they belong to, i.e., X = 
[X l5 X 2 ,..., X c ] G M pxn , where c is the number of classes. The objective 
function in FDDL consists of two terms: a fidelity term and a discrimination 
constraint term on coefficients 

J(D, A) = min r(X, D, A) + Ai || A|| x + A 2 /(A), (25) 
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where r(X,D,A) is the fidelity term and /(A) is the discrimination con¬ 
straint on the coefficients. 

The fidelity term is defined in [7] as follows 


r(X, D, A) = ||Xi - DAj||p + ||X, - D,A 


C 


*IIF 1 11F ’ 

3 = 1 


(26) 


where D, is the part of the dictionary associated with class i, and Aj is the 
representation of Xj over D. Also Ai = [Aj, A |,.... A?], where Aj is the part 


of the coefficients that represent Xj over the subdictionary D r In (26), the 


first two terms indicate that the whole dictionary and also the subdictionary 
associated with class i should well represent the data samples in the same 
class Xj, whereas the last term indicates that the subdictionaries from other 
classes have little contribution towards the representation of the data samples 
in class i. 

The Fisher discrimination term, on the other hand, is as follows 


/(A) = tr(S w (A)) - tr(5 B (A)) + V || A\\ 2 F , (27) 


where tr is the trace operator; S'w and Sb are within- and between-class 


covariance matrices, respectively. The last term is a penalty added to (27) 
to make the optimization problem convex [7J. 

Discussion. The joint optimization problem, due to the Fisher discrimina¬ 
tion criterion on the coefficients and the discriminative fidelity term on the 


dictionary proposed in (25), is not convex, and has to be solved iteratively 


and alternately between these two terms until it converges, ffowever, there 
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is no guarantee to find the global minimum. Also, it is not clear whether the 
improvement obtained in classification by including the Fisher discriminant 
criterion on coefficients justifies the additional computation load imposed on 
the learning, as there is no comparison provided in [7] on the classification 
with and without including supervision on coefficients. 

3.6. Learning a Histogram of Dictionary Elements over Signal Constituents 
There are situations where a signal is made of some local constituents, 
e.g., an image is made up of patches or a speech, which is consisting of 
phonemes. However, the ultimate classification task is to classify the signal, 
not its individual local constituents, e.g., the whole image, not the patches 
in the previous example. This classification task is usually tackled by com¬ 
puting the histogram of dictionary atoms computed over local constituents 
of a signal. The computed histograms are used as the signature (model) 
of the signal, which are eventually used for the training of a classifier and 
predicting the labels of unknown signals. Unlike the previous five categories, 
the motivation of the approaches in the sixth S-DLSR category is to design 
a supervised dictionary, which is discriminative over the histogram repre¬ 
sentation of signals, not over individual local descriptors [MS]. Hence, 
these approaches cannot be used in cases where a signal does not consist of 
a collection of local constituents. The main approaches in this category are: 
1) texton-based method, 2) histogram computation using DLSR, 3) univer¬ 
sal and adapted vocabularies, and 4) supervised dictionary learning model 
(SDLM). The following subsections provide the description of these methods. 
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3.6.1. Texton-based approach 

The texton-based approach [201 [231 (50H53j, is one of the earliest methods 
that was proposed to compute the histogram of dictionary elements, called 
textons, to model a texture image based on patches extracted. This approach 
was particularly proposed for texture analysis, but is sufficiently general to 
be used in other applications. In a texton-based approach, the first step is to 
construct the dictionary. To this end, small-sized local patches are randomly 
extracted from each texture image in the training set. These small patches 
are then aggregated over all images in a class, and clustered using a clustering 
algorithm such as £;-means. Obtained cluster centers form a dictionary that 
represents the class of textures used. In other words, supervised /j-means is 
used to compute the dictionary atoms [201123]. 

The next step is to find the features (learn the model) using the images 
in the training set. To this end, small patches of the same size as the pre¬ 
vious step are extracted by sliding a window over each training image in a 
class. Then the distance between each patch to all textons in the dictionary 
are computed, to find the closest match using a distance measure such as 
Euclidean distance. Finally, a histogram of textons is updated accordingly 
for each image based on the closest match found. This yields a histogram 
for each image in the training set, which is used as the features representing 
that image after normalization. Figure [2] illustrates the construction of the 
dictionary and learning of the model in a texton-based system. 
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Figure 2: The illustration of two steps of a texton-based system: (a) the 
generation of texton dictionary using supervised fc-means (b) and the gener¬ 
ation of features by computing the texton histograms on an image (reused 
from |T2j courtesy of Springer Science). 


3.6.2. Histogram computation using dictionary learning and sparse represen¬ 
tation 

In the texton-based approach, supervised fc-means was used to compute 
the dictionary. To compute the histogram of textons, each patch was repre¬ 
sented by the closest match in the dictionary. This is the maximum sparsity 
possible as each patch is represented by only one dictionary element. How¬ 
ever, as proposed in EH. it is possible to use (J3]) and one of the recent 
algorithms for its implementation, such as online learning El. to compute 
the dictionary and the corresponding sparse coefficients over the patches ex- 


36 













































tracted from an image. The same as the texton-based approach, building the 
dictionary and histogram of dictionary elements can be done in two steps. 

In the first step, random patches are extracted from each image in the 
training set. Next, by submitting these patches into the online learning 
algorithm, the dictionary can be computed GB 

As the second step, it is needed to find the model (feature set) for each 
image. To this end, patches of the same size as those in the dictionary 
learning step are extracted from each image. Let x* be the i th image in the 
training set. The signal constituents (i.e., patches) of x ?: can be denoted as 
Xj = [xl, xf,..., x” 1 ] G where m is the number of patches extracted, and 

each patch size is y/tX\/i. Then using ([ 3 J), the corresponding coefficients = 
[ad, ctf, ..., a” 1 ] G M fcxm are computed (k is the number of dictionary atoms). 
For each patch xj, most of the elements in the corresponding coefficient aj 
are zero. The nonzero elements in cxj determine the atoms in the dictionary 
D that contribute towards the representation of the patch xj. If all these 
coefficients are summed up for all patches extracted from an image, one can 
effectively fold the histogram of primitive elements contributing towards the 
representation of this particular image, i.e., 


m 



(28) 


A histogram h with positive values in all bins can be eventually obtained 


§ 


by imposing a positive constraint on ctj in (pi). The positive constraint also 


prevents canceling the effect of different patches when they are summed up 


in (28). 
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Figure 3: Taxonomy of dictionary learning and sparse representation as pre¬ 
sented in this paper. Supervised dictionary learning and sparse representa¬ 
tion (S-DLSR) approaches are divided into six categories. 

In this way, while in a texton-based approach each patch is represented 
using only the closest texton in the dictionary, here each patch is represented 
by using several primitive elements in the dictionary, and hence can poten¬ 
tially provide richer representation than the texton-based approach. The 
number of nonzero elements in ctj, and consequently in oii, can be controlled 
using A, which is the sparsity parameter in (J3|, i.e., larger values of A yield 
sparser coefficients m- 

3.6.3. Universal and adapted vocabularies (UAV) 

Although the above two approaches include the category information into 
the learning of individual dictionary atoms, they do not include the class 
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labels into the learning of the histograms. This is while the main goal is to 
make the histogram discriminative not the individual dictionary elements as 
the ultimate goal is to classify the signal not its constituents. For example, a 
white patch may appear in outdoor scenes as part of a cloud in sky as well as 
on indoor scenes as the color on the ceiling of a kitchen. However, the main 
goal is to classify the scenes to indoors and outdoors and hence putting some 
efforts to make the individual dictionary elements discriminative might be 
misleading (a white dictionary atom may appear in both outdoor and indoor 
scenes in previous example). To address these kind of problems, Perronnin 
has proposed in [ST] to learn one bipartite histogram per class for each image. 
Each bipartite histogram, as the name implies, has two parts: a part adapted 
to the specific class, and a universal part. In each histogram, ideally, if the 
object belongs to the class, its adapted part is more significant than the 
universal one; otherwise the universal part is more dominant. 

Gaussian mixture models (GMM) are used to learn the universal vocabu¬ 
laries (dictionaries) using maximum likelihood estimation (MLE) for low level 
local descriptors such as scale-invariant feature transform (SIFT) descriptors. 
Then class specific vocabularies are adapted by the maximum a posteriori 
(MAP) criterion. Eventually, the bipartite histograms are estimated by using 
the adapted and universal vocabularies [8Tj . 

3.6.4■ Supervised dictionary learning model (SDLM) 

A supervised dictionary learning model (SDLM) is proposed in [75] . which 
combines an unsupervised model based on a Gaussian mixture model (GMM) 
with a supervised model, i.e., a logistic regression model in a probabilistic 
framework. As explained in the beginning of this subsection, the motivation 
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of this model is to learn the dictionary such that the histogram representation 
of images are sufficiently discriminative over different classes. Intuitively, in 
SDLM, a logistic loss function is used to pass the discriminative information 
in class labels to histogram features. This information is subsequently passed 
to the dictionary learned over image local features by affecting the GMM 
parameters eg. 

Discussion. As mentioned earlier, the approaches in this category are mainly 
designed to classify signals using the histogram of dictionary atoms built on 
the signal constituents. Since the ultimate goal is to classify the signals per se 
not their constituents, it is reasonable to make the histograms discriminative 
not necessarily the individual dictionary atoms. The last two approaches in 
this category, i.e., UAV and SDLM place more emphasis on this attribute and 
propose methods to include category information into the learning process 
such that the histogram of dictionary atoms represent the signals in the most 
discriminative way possible. 

Figure [3] summarizes the taxonomy of dictionary learning and sparse rep¬ 
resentation techniques as presented in this paper for a quick reference. 

4. Summary and Guidelines for Practitioners 

This section summarizes the methods presented in Section [3] from the 
perspective of how the different building blocks of an S-DLSR algorithm 
are represented and learned. This perspective allows for the readers who 
are interested in building an S-DLSR solution to evaluate different design 
decisions and select the best practices for the problem at hand. 
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There are three building blocks in an S-DLSR method: (1) the dictionary 
D, (2) the coefficients A, and (3) the classifier parameters W. The methods 
presented in Section [3] vary in how they represent and learn these blocks. 
In the rest of this section, we summarize the different design decisions for 
the representation and learning of each of these blocks and comment on the 
advantages and shortcomings of each. 

4-1. The dictionary D 

The dictionary D G M. pxk consists of a set of k atoms. Each atom dj is a 
p-dimensional vector which can be represented as: 

i) a single data instance, i.e., cfi = x*. This representation was used by the 
seminal work of sparse representation-based classification (SRC) [6j. If the 
size of the training samples is manageable, this representation is very effi¬ 
cient as no overhead is needed to form the dictionary atoms. In addition, 
in the case that the classification is based on the minimum residual error 
given in (J7|), this representation is easily interpretable as the system user 
can inspect the dictionary elements which result in the minimum residual er¬ 
ror, and understand how the classification decision was made. On the other 
hand, this approach is sensitive to noisy training instances as the dictionary 
atoms are the training samples. Moreover, if the number of training samples 
is large, this representation is inefficient as it will be computationally com¬ 
plex to decode new signals. It will also be infeasible to store and transfer a 
large dictionary of instances especially when the recognition system are to 
be deployed on a modest hardware such as that of portable devices. One 
possible way is to reduce the size of the dictionary by selecting a representa- 
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tive subset of the dictionary atoms. This can be done by representing atoms 
(i.e., columns of D) as vectors in the space of features and select a subset 
of these vectors such that the reconstruction error of other atoms (or the 
matrix D) based on the selected vectors is minimized. This problem is for¬ 
mally known as column subset selection (CSS) and several research efforts 
have been conducted for solving this problem. [31 1821 - iM] . 

ii) a function of multiple data instances. This representation was used by the 
metaface method H as well as its extensions [55]. The basic idea here is to 
learn a small dictionary in which each atom is a linear/nonlinear combination 
of many data instances. The main advantage of this representation is the 
simplicity of the approaches used to learn the dictionary. However, since 
the size of the dictionary linearly increases with the number of classes, these 
approaches may lead to large dictionaries when there exist many classes. 
In addition, if the dictionary atoms are dense combination of many data 
instances, it will be difficult to interpret their meaning and reason about the 
different classification decisions. 


in) a function of signal constituents. An example of this representation is the 


texton-based approach (Subsection 3.6.1), where each atom is an average-like 
function of some constituents of the original signal. This representation is 
useful for problems where signals are known to be constructed of some con¬ 
stituents, such as patches of natural scenes and texture images. The challeng¬ 
ing task associated with this representation is the learning of a discriminate 
dictionary using the signal constituents. For some problems, like texture 
analysis, this task has been extensively studied, and the clustering of many 
candidate constituents has been considered as one of the effective methods 
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to learn the dictionary. Moreover, some signal decomposition algorithms like 
non-negative matrix factorization (NNMF) are known to decompose signals 
into their parts j33j 134] and can accordingly be used to learn the dictionary 
in this case. 

From the learning perspective, the dictionary D is learned: 


i) per class. (Subsection 3.1) While the methods in this category are more 


simple and computationally efficient, they suffer from two main shortcom¬ 
ings; (1) there might be a redundancy between the atoms in the learned 
dictionaries of different classes (e.g., a signal constituent that is common to 
more than one class), and (2) the methods can easily ignore very descrip¬ 
tive atoms that are functions of data instances from different classes (e.g., a 
metaface that combines positive features from one class and negative features 
from the other). 


%%) unsupervised learning with supervised pruning. (Subsection 3.2) In this 


approach, a large dictionary is first learned unsupervised. The class labels are 
only included in the pruning step. The initial large dictionary and subsequent 
pruning step, however, increases the computational complexity. Moreover, 
the ultimate discrimination power of the pruned dictionary is always less 
than the initial large one and therefore, the highest classification performance 
depends on the discrimination power of the initial large dictionary. 


in) using all class labels. (Subsections |3.3| and |3.4[ ) This category of methods 
solves a usually complex optimization problem that maximizes the descrip¬ 
tiveness of the atoms while minimizing their redundancy. Most of methods 
are however computationally demanding as the optimization problem has to 
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be solved iteratively and alternately among the dictionary, coefficients, and 
even classifier parameters. 


4-2. The coefficients A 

Each data instance x* can be represented in terms of dictionary atoms 
using the coefficient vector These coefficients are usually learned such 
that the reconstruction error of the original data instance (or its parts) using 
the dictionary atoms is minimized as can be seen from (J3]) . 

From the representation perspective, a coefficient vector for a data in¬ 
stance represents: 

i) a linear combination of atoms. This is the most common representation of 
the coefficient vector. Given a data instance and the dictionary atoms, the 
data instance is usually represented as a sparse combination of dictionary 
atoms. This representation is suitable to the cases where the dictionary 
atoms can be used to reconstruct the original data instance. 


ii) a histogram over atoms. This representation is used when the dictionary 


atoms represent constituents of the data instances (Subsection 3.6). In this 
case, the constituents of the new data instance are first selected, and then 
the coefficient vector is represented as a histogram over the closest atoms for 
these constituents in the dictionary. 

From the learning perspective, the coefficients A are learned: 


i) for test samples only. Some of the S-DLSR algorithms do not require the 
learning of coefficients for the training instances. Instead, the training data 
are used for constructing (or learning) the dictionary and then the coefficients 
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are only learned for the new test samples. These methods usually use a simple 
classification model over the learned dictionary atoms (like nearest neighbour 
classifiers or the minimum residual classifier). Examples of these algorithms 
are sparse representation-based classification (SRC) [B]. 


ii) for training and test samples. When a complex classification model (like 
SVM) needs to be learned, the coefficients matrix corresponding to training 
samples A e M fcxn are also needed. These coefficients represent training 
instances in the space of dictionary atoms. The learning of coefficients can 
be done separately after the dictionary is learned, when the dictionary has 
closed-form solution, such as in the HSIC-based S-DLSR ra. or simulta¬ 
neously with the dictionary, which is the case in most S-DLSR methods, 
where there is no closed-form solution for the dictionary. Moreover, the cat¬ 
egory information can be used to learn more discriminative coefficients (see 


Subsection 3.5). 


4-3. The classification model W 

In S-DLSR methods, the classifier receives as an input the encoding of a 
new data instance in the space of dictionary atoms and returns the encoding 
of the data instance in the space of classes. 

From the representation perspective, the classifier can be: 

i) a binary map from atoms to classes. This is the simplest representation 
used by the S-DLSR methods, in which each group of dictionary atoms maps 
to a separate class. A new data instance is first mapped to the space of 
atoms and then a simple classification rule is employed to assign this new 
instance to one of the classes. This approach is computationally efficient 
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and it is easy to interpret the classification decisions by inspecting the atoms 
of the assigned class. However, this simple classification rule cannot handle 
complex class assignment where data points are not directly mapped to the 
atoms of a single class. 


ii) a linear map from atoms to classes. For linear classifiers, the classifier 
parameters W form a mapping from the space of dictionary atoms to that 
of the classes. In this case, the coefficients matrix A needs to be learned for 
the training data and then a linear classification model is learned over these 
coefficients. In some algorithms, the learning of classifier is done simultane¬ 


ously with the learning of dictionary (see Subsection 3.3). This approach is 
more complex than the first one but usually results in better classification 
decisions. 


Hi) a non-linear map from atoms to classes. When the data instances in 
the space of atoms are not linearly separable, one might consider learning a 
nonlinear classifier (such as SVM with an RBF kernel) over the coefficients 
matrix A. The use of nonlinear classifiers, however, makes it more compu¬ 
tationally difficult to simultaneously learn the dictionary and/or coefficients 
with the classification models. 

From the learning perspective, a classification model is learned: 

i) separately, after learning D and A. Given the representation of the data in 
the space of atoms, traditional algorithm for supervised learning can be used 
for learning a classification model for the problem at hand. These methods 
are more simple and they allow the different existing algorithms to be used 
with the learned dictionary. One the other hand, learning the coefficient 
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in isolation from the classification model might result in a representation of 
the data instances that does not necessarily capture the separation between 
different classes. 

ii) while learning D and/or A. This approach is more computationally de¬ 
manding than the first category but it allows for the learning of a represen¬ 
tation in which data instances from different classes are well separated in the 
space of dictionary atoms. This can potentially result in better classification 
decisions. However, the joint optimization problem obtained for learning 
both the dictionary and classifier parameters is non-convex, which has to 
be solved iteratively and alternately. The non-convex optimization problem 
may lead to some local minima, i.e., sub-optimal solutions. Moreover, due to 
the complexity of the joint optimization problem, linear classifiers are mainly 
used, and they may not preform adequately well in more subtle classification 
tasks [25] . 

Tables [2] and [3] provide a summary of the discussion provided in this sec¬ 
tion for the three building blocks of an S-DLSR method from representation 
and learning perspectives, respectively. 

5. Conclusion 

Supervised dictionary learning and sparse representation (S-DLSR) is an 
emerging category of methods that result in more optimal dictionary and 
sparse representation in classification tasks. In this paper, we surveyed the 
state-of-the-art techniques for S-DLSR and presented a comprehensive views 
of these techniques. We have identified six main categories of S-DLSR meth¬ 
ods and highlighted the advantages and shortcomings of the methods in each 
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Table 2: The representation of different components of an S-DLSR solution. 


Component 

Representation 

Summary 


Each atom can be represented as: 

Dictionary D 

a single data instance 

- easy to interpret, efficient if training data is small 

- sensitive to noisy training instances, inefficient and infeasible 
to store and transfer if training data is large 


a function of multiple 
data instances 

- smaller dictionary, simple learning algorithms 

- more difficult to interpret, size increases with the number of 
classes 


a function of signal con¬ 
stituents 

- smaller dictionary, suitable when signals are constructed of 
some constituents 

- more difficult to interpret 


A coefficient vector for a data instance can be represented as: 

Coefficients A 

a linear combination of 

atoms 

suitable when atoms can reconstruct the signals 


a histogram over atoms 

suitable when atoms represent constituents of signals 


A classification model can 

be represented as: 

Classifier W 

a binary map from atoms 
to classes 

- computationally efficient and easy to interpret 

- cannot handle complex class assignment 


a linear map from atoms 
to classes 

- more accurate 

- more computationally complex 


a non-linear map from 
atoms to classes 

- suitable for complex data with non-linear classes 

- computationally infeasible to learn simultaneously with D 
and A 


Table 3: The learning of different components of an S-DLSR solution. 


Component 

Learning 

Summary 

Dictionary D 

per class 

- simple and computationally efficient 

- redundancy among atoms, prone to ignoring descriptive 
atoms 

unsupervised learning 

with supervised pruning 

- less redundancy among atoms 

- computationally complex 

using all class labels 

- more optimal in terms of redundancy and descriptiveness 

- complex optimization, computationally demanding 

Coefficients A 

for test samples only 

- simple classification model 

- less accurate 

for training and test 
samples 

- more accurate for complex data 

- more computationally demanding 

Classifier W 

separately, after learning 

D and A 

- simpler, usable with different DL algorithms 

- less separation between classes 

while learning D and A 

- more classification accuracy for linear classifiers 

- computationally complex, sub-optimal solutions 


category. Furthermore, we have provided a summary of the building blocks 
for an S-DLSR method including the dictionary, sparse coefficients, and clas¬ 
sifier parameters from two perspectives: representation and learning. This 
enables the researchers to decide on how to choose these blocks to design a 
new S-DLSR algorithm based on the problem at hand. This review addresses 
a gap in the literature and is anticipated to advance the research in S-DLSR 
and its applicability to a variety of domains. 
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